Unsupervised Learning has been gaining momentum as Recommendation Systems have grown in popularity. These algorithms tend to be used in more advanced use cases, though understanding them is key to becoming an effective Data Scientist or ML Engineer. This tutorial will cover the following learning objectives:
K-Means Clustering
Hierarchical Clustering
Principal Component Analysis (PCA)
K-Means Clustering
Summary
K-Means is the most common algorithm for conducting cluster analysis. "k" represents the number of clusters specified. This algorithm allows you to input a set of Training Data, specify a number of clusters, and use the algorithm to cluster the data points by the means of the features provided.
K-Means is considered "Unsupervised" because you are allowing the algorithm to assign the Training Data to specific classes rather than telling it which class each data point is assigned to.
To perform a cluster analysis with the K-Means alorith, use the following steps:
Define the number of clusters
Set cluster centers randomly
The center for each cluster is known as a Centroid
Assign points to clusters
The algorithm will then measure the distance for each data point between each Centroid. The data point will be assigned to a cluster based on the closest Centroid.
Calculate the center of each cluster
Once the initial clusters have been created, the official Centroids will be established as close to the middle of the cluster as possible.
Assign points to the new clusters
With the new Centroids created, we'll now repeat step 3 by assigning each data point to a cluster based on the Centroid closest to it.
Repeat Steps 4 and 5 Until Convergence is reached
Convergence is when every cluster is as equally distrributed as possible, given the centroid distances and locations.
When using pre-built K-Means tools such as RapidMiner or SciKit-Learn, the initial clusters have been optimized to reduce biases.
The Elbow Method calculate the summed distance between the points and the Centroid where with every iteration, the another cluster is added to the algorithm until it meets the max number of clusters specified. In each interation, the summed distance between the points and the Centroid gets smaller. When anaylzing the results of the Elbow Method, you look for where the change in efficiency starts to plateau and that is considered the optimal number of clusters.
NOTE: K-Means is best used for creating classes for use in training a Decision Tree. For example, say you want to classify customers into three classes: Target Market, Common, and Less Valuable. You can use K-Means to cluster the customers based on their features and then use a Decision Tree Classifier and/or Random Forest to find which features best predict the classes
Hierarchical Clustering
Summary
Hierarchical Clustering is a technique used for clustering clusters of groups into tiers. These tiers become more specific as you go down the list.
There are two types of Hierarchical Clustering: Agglomerative and Divisive. Divisive Clustering takes a top-down approach where all data points begin in the same cluster and then get broken down into sub-clusters based on similar features. Agglomerative Clustering takes a bottom-up approach where you start with individual data points and start grouping them into ever increasing groups based on similar features.
Agglomerative Clustering starts with each data point representing its own cluster, then using euclidean distance, the formula that is used in K-Means to find the distance between a data point and a Centroid, a new cluster gets created with two data points. This process gets repeated until one massive cluster remains.
A Dendrogram is the visualization of Hierarchical Clusters. This allows you to map how each data point relates to others and what features separate each cluster. The distance between links in the Dendrogram show how similar or different two data points are. The closer they are, the more closely they are related.
NOTE: Hierarchical Clustering is very computationally expensive, meaning that the more data points you have, and the more potential clusters that can be created, the computing power necessary increases exponentially (starting at a certain point, differs between systems).
Principal Component Analysis (PCA)
Summary
Principal Component Analysis, also known as Exploratory Factor Analysis, is a statistical method for identifying structures in your Training Data. This is useful when you have many features to consider and want to investigate the correlations between them. These correlations are the basis of PCA.
PCA is used to group features that are most highly correlated. This is used to create a sort of "cause and effect" diagram to visualize how one feature impacts another.
In PCA, a factor is a hidden variable that affects several other variables. For example, someone who smokes could have lung cancer. In this example, the factor is the smoking and the effect is the lung cancer.
Just like finding the optimal number of cluster in K-Means, the Scree Test allows you to create a descending line chart that shows the number of factors used in the iteration on the x-axis and the measure of effectiveness on the y-axis. However, when interpretting the "Screeplot", you choose the largest value that has an "eigenvalue" greater than 1 as the optimal number of factors to consider in the algorithm.
Factor Loading is the process of identifying how correlated each feature is with its assigned factor.
The Eigenvalue explains how effective the number of factors is at explaining the amount of variance across all variables. The reason why you don't want to choose the number of factors with the highest eigenvalue is because you want to find a perfect medium between an even distribution and a highly explained variance.
NOTE: When developing a PCA in a pre-built tool such as RapidMiner or SciKit-Learn, the rotated component matrix will automatically be applied to allow you to best analyze the results without any bias.